We investigate a dataset of 102 bulk RNA-seq samples extracted from the prefrontal cortex of 102 deceased individuals (overview). The dataset is provided by H Rehrauer of FGCZ (details). Roughly half of the samples are from patients diagnosed with ALS, the remaining ones are referred to as the control group. We attempt to infer cell type proportions hidden in each bulk RNA-seq sample. This procedure, known as cell type deconvolution (overview), should be integral to inference from the sample to disease condition. We focus here on reference-based deconvolution: in essence, if \(X\) is a bulk RNA-seq count vector, and \(a, b, \ldots\) are single-cell RNA-seq count vectors of representatives of relevant cells types, one aims to find non-negative coefficients \(\alpha, \beta, \ldots\) such that \(X \approx \alpha a + \beta b + \ldots\). Unfortunately, this reconstruction is confounded by batch effects such as difference in sequencing platforms, in particular different statistical properties of the technical noise of bulk versus single-cell RNA-seq.
As cell type reference datasets we resort to single-cell RNA-seq dataset Darmanis et al (~300 labeled cells used) and single-nucleus RNA-seq dataset Allen Brain M1 (>70k labeled cells). The two are compared here.
We apply the following packages/algorithms:
Bisque (R) prior to estimating cell type proportions transforms the bulk gene expression gene-wise to match the distribution from the reference dataset. It expects the reference single cells to come from several individuals. Links: paper, vignette, manual, github, ReferenceBasedDecomposition (uses limSolve).
MuSiC (R) “up-weighs genes with low cross-subject variance” and vice versa, and recursively estimates the proportion of clusters of similar cell types using genes of low within-cluster variance. Links: paper, tutorial, github, music_prop.
NNLS is a barebones Python implementation of cell type proportion reconstruction using non-negative least squares. Links: code.
RNA-Sieve (Python/Mathematica) incorporates the dependence of the observation noise on cell type composition of the bulk sample into the maximum likelihood estimator, and can also provide confidence intervals. Links: preprint, github, code.
DWLS (R) identifies differentially expressed genes as markers and averages those for each cell type. The absolute square residual in the usual NLLS is replaced by a relative one, requiring an iterative procedure to find a minimizer. Links: paper, github, bitbucket, quickstart, manual.
This script: Source code, output folder.
suppressPackageStartupMessages({
requireNamespace("magrittr")
library(dplyr)
library(ggplot2)
requireNamespace("pheatmap")
requireNamespace("stringr")
requireNamespace("reshape2")
requireNamespace("purrr")
requireNamespace("pbapply")
# avoid importing `exprs` that leads to clashes
requireNamespace("rlang")
# https://vroom.r-lib.org/articles/vroom.html
# install.packages("vroom")
requireNamespace("vroom")
# install.packages("pathlibr")
requireNamespace("pathlibr")
# Provides `ExpressionSet` data structure
# BiocManager::install("Biobase")
requireNamespace("Biobase")
# Deconvolution package
# install.packages("BisqueRNA")
requireNamespace("BisqueRNA")
# According to https://github.com/meichendong/SCDC :
# install.packages("remotes")
# remotes::install_github("renozao/xbioc")
requireNamespace("xbioc") # For MuSiC and such
# https://xuranw.github.io/MuSiC/articles/MuSiC.html
# install.packages("devtools")
# devtools::install_github("xuranw/MuSiC")
requireNamespace("MuSiC")
# BiocManager::install("MAST")
# remotes::install_bitbucket("yuanlab/dwls")
requireNamespace("ROCR")
requireNamespace("MAST")
requireNamespace("DWLS")
# install.packages("kableExtra")
requireNamespace("kableExtra")
})
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=lzh_TW
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=lzh_TW LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=lzh_TW LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=lzh_TW LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 dplyr_1.0.2
##
## loaded via a namespace (and not attached):
## [1] backports_1.2.1 plyr_1.8.6
## [3] igraph_1.2.6 lazyeval_0.2.2
## [5] splines_3.6.3 BiocParallel_1.20.1
## [7] listenv_0.8.0 scattermore_0.7
## [9] GenomeInfoDb_1.22.1 digest_0.6.27
## [11] htmltools_0.5.0 magrittr_2.0.1
## [13] checkmate_2.0.0 memoise_1.1.0
## [15] tensor_1.5 cluster_2.1.0
## [17] ROCR_1.0-11 globals_0.14.0
## [19] matrixStats_0.57.0 vroom_1.3.2
## [21] MCMCpack_1.4-9 BisqueRNA_1.0.4
## [23] colorspace_2.0-0 rvest_0.3.6
## [25] blob_1.2.1 ggrepel_0.9.0
## [27] xfun_0.19 crayon_1.3.4
## [29] RCurl_1.98-1.2 jsonlite_1.7.2
## [31] spatstat.data_1.7-0 spatstat_1.64-1
## [33] MuSiC_0.1.1 survival_3.1-8
## [35] zoo_1.8-8 glue_1.4.2
## [37] kableExtra_1.3.1 polyclip_1.10-0
## [39] registry_0.5-1 gtable_0.3.0
## [41] nnls_1.4 zlibbioc_1.32.0
## [43] XVector_0.26.0 webshot_0.5.2
## [45] MatrixModels_0.4-1 leiden_0.3.6
## [47] DelayedArray_0.12.3 future.apply_1.6.0
## [49] SingleCellExperiment_1.8.0 BiocGenerics_0.32.0
## [51] abind_1.4-5 SparseM_1.78
## [53] scales_1.1.1 pheatmap_1.0.12
## [55] DBI_1.1.0 miniUI_0.1.1.1
## [57] Rcpp_1.0.5 DWLS_0.1
## [59] viridisLite_0.3.0 xtable_1.8-4
## [61] reticulate_1.18 bit_4.0.4
## [63] rsvd_1.0.3 stats4_3.6.3
## [65] htmlwidgets_1.5.3 httr_1.4.2
## [67] RColorBrewer_1.1-2 ellipsis_0.3.1
## [69] Seurat_3.2.3 ica_1.0-2
## [71] reshape_0.8.8 pkgconfig_2.0.3
## [73] uwot_0.1.10 deldir_0.2-3
## [75] tidyselect_1.1.0 rlang_0.4.9
## [77] reshape2_1.4.4 later_1.1.0.1
## [79] AnnotationDbi_1.48.0 munsell_0.5.0
## [81] tools_3.6.3 generics_0.1.0
## [83] RSQLite_2.2.1 xbioc_0.1.19
## [85] ggridges_0.5.2 evaluate_0.14
## [87] stringr_1.4.0 fastmap_1.0.1
## [89] goftest_1.2-2 yaml_2.2.1
## [91] mcmc_0.9-7 knitr_1.30
## [93] bit64_4.0.5 fitdistrplus_1.1-3
## [95] purrr_0.3.4 RANN_2.6.1
## [97] nlme_3.1-144 pbapply_1.4-3
## [99] future_1.21.0 mime_0.9
## [101] quantreg_5.75 xml2_1.3.2
## [103] rstudioapi_0.13 compiler_3.6.3
## [105] plotly_4.9.2.2 png_0.1-7
## [107] e1071_1.7-4 spatstat.utils_1.17-0
## [109] tibble_3.0.4 stringi_1.5.3
## [111] lattice_0.20-40 Matrix_1.2-18
## [113] vctrs_0.3.6 pillar_1.4.7
## [115] lifecycle_0.2.0 BiocManager_1.30.10
## [117] lmtest_0.9-38 RcppAnnoy_0.0.18
## [119] data.table_1.13.4 cowplot_1.1.0
## [121] bitops_1.0-6 irlba_2.3.3
## [123] conquer_1.0.2 httpuv_1.5.4
## [125] patchwork_1.1.1 GenomicRanges_1.38.0
## [127] R6_2.5.0 promises_1.1.1
## [129] gridExtra_2.3 KernSmooth_2.23-16
## [131] IRanges_2.20.2 parallelly_1.22.0
## [133] codetools_0.2-16 MASS_7.3-51.5
## [135] assertthat_0.2.1 SummarizedExperiment_1.16.1
## [137] MAST_1.12.0 pkgmaker_0.32.2
## [139] withr_2.3.0 sctransform_0.3.2
## [141] S4Vectors_0.24.4 GenomeInfoDbData_1.2.2
## [143] mgcv_1.8-31 parallel_3.6.3
## [145] rpart_4.1-15 quadprog_1.5-8
## [147] grid_3.6.3 pathlibr_0.1.0
## [149] tidyr_1.1.2 coda_0.19-4
## [151] class_7.3-15 rmarkdown_2.6
## [153] Rtsne_0.15 Biobase_2.46.0
## [155] shiny_1.5.0
BASEPATH <- pathlibr::Path$new(".")
stopifnot("deconvolution.Rmd" %in% names(BASEPATH$dir))
out_file <- (function(.) BASEPATH$join("output")$join(.)$show)
dir.create(out_file(""), showWarnings = FALSE)
path_to <- (function(.) Sys.glob(BASEPATH$join("../../..")$join(.)$show))
set.seed(43)
# Use first column as index
by_col1 <- (function(.) tibble::column_to_rownames(., colnames(.)[1]))
# Use index as new column `name`
ind2col <- (function(., name) tibble::rownames_to_column(., var = name))
group_by_and_sum <-
function(., column_name) {
(.) %>%
dplyr::group_by(!!rlang::sym(column_name)) %>%
dplyr::summarise_all(sum) %>%
tibble::column_to_rownames(column_name)
}
fix_names <- function(.) {
(.) %>%
# https://stackoverflow.com/a/55184433
rlang::set_names(stringr::str_replace(names(.), "cell type", "celltype"))
}
col_norm2 <- (function(.) t(t(.) / sqrt(colSums((.)**2))))
The function utils::read.delim is too slow to read wide tables (Allen Brain M1 has >70k cells).
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 32)
from_csv <- function(.) {
vroom::vroom(., del = "\t") %>%
by_col1() %>%
suppressMessages()
}
A call to to_csv should be preceded by ind2col("name for index") to write the row names.
to_csv <- function(., f) {
vroom::vroom_write(., out_file(f), delim = "\t")
}
hush <- base::invisible
ggplot2::theme_set(theme_light(base_size = 15))
kable <- function(.) {
kableExtra::kbl(., align = "c") %>%
kableExtra::kable_paper("hover", full_width = F)
}
# Wide plots (inches?)
WIDTH1 <- 20
HEIGHT1 <- 3
hist_theme <-
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
)
save_barchart <-
function(., filename) {
as.data.frame(.) %>%
ind2col("celltype") %>%
reshape2::melt(
id = "celltype",
var = "sample",
value.name = "y"
) %>%
{
ggplot(., aes(x = sample, y = y, fill = celltype)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Paired") +
ylim(0, 1) +
hist_theme +
theme(
legend.title = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(
a = -90,
vjust = 0.5,
hjust = 0,
size = 10,
)
) +
ggsave(
filename = out_file(filename),
width = WIDTH1,
height = HEIGHT1,
device = "png"
)
} %>%
suppressWarnings()
return(.)
}
save_heatmap <-
function(., filename) {
as.data.frame(.) %>%
pheatmap::pheatmap(
filename = out_file(filename),
cluster_rows = FALSE,
cluster_cols = FALSE,
fontsize_row = 14,
fontsize_col = 10,
width = WIDTH1,
height = HEIGHT1
)
return(.)
}
fgcz_data <-
path_to("data/20201128-FGCZ/*count.zip") %>%
unz(., unzip(., list = TRUE)$Name) %>%
from_csv() %>%
# Collapse ENSG IDs by gene_name:
group_by_and_sum("gene_name")
fgcz_meta <- from_csv(path_to("data/20201128-FGCZ/*infos.tsv"))
This will be used later.
assert_fgcz_meta_order <-
function(.) {
stopifnot(all(rownames(.) == rownames(fgcz_meta)))
return(.)
}
t(fgcz_data) %>%
assert_fgcz_meta_order() %>%
hush()
darm_data <- from_csv(path_to("data/*/2015-Darmanis/b*/data.csv.gz"))
darm_meta <- from_csv(path_to("data/*/2015-Darmanis/b*/meta.csv.gz")) %>%
fix_names()
exclude_celltypes <- c("fetal_quiescent", "fetal_replicating", "hybrid")
Plot cell type counts.
darm_meta$celltype %>%
data.frame(x = .) %>%
ggplot(aes(x = x, fill = if_else(x %in% exclude_celltypes, "drop", "keep"))) +
geom_bar() +
ggtitle("Cell types in the 'Darmanis' reference dataset") +
scale_y_log10() +
hist_theme +
theme(axis.title.x = element_blank()) +
theme(axis.text.x = element_text(a = 45, hjust = 1)) +
theme(legend.title = element_blank())
Remove unnecessary cell types from the scRNA reference dataset.
darm_meta <-
darm_meta %>%
dplyr::filter(!(celltype %in% exclude_celltypes))
darm_data <-
darm_data %>%
dplyr::select_if(names(.) %in% rownames(darm_meta))
stopifnot(285 == nrow(darm_meta))
stopifnot(285 == ncol(darm_data))
abm1_meta <- from_csv(path_to("data/*/2019-AllenBrain-M1/b*/meta.csv*"))
abm1_data <- from_csv(path_to("data/*/2019-AllenBrain-M1/b*/data.csv*"))
Some sanity checks.
# Number of single cells
stopifnot(ncol(abm1_data) > 70000)
stopifnot(nrow(abm1_meta) > 70000)
# Number of genes and some examples
stopifnot(nrow(abm1_data) == 141)
stopifnot(all(c("PIK3CD", "WNT4", "LDLRAP1") %in% rownames(abm1_data)))
Drop samples with too little expression, otherwise Bisque doesn’t go through.
abm1_data <-
abm1_data %>%
# Keep only genes common with the FGCZ dataset
{
(.)[rownames(.) %in% rownames(fgcz_data), ]
} %>%
# Drop zero columns (dplyr variant is too slow here)
{
(.)[, colSums(.) != 0]
}
abm1_meta <-
abm1_meta %>%
dplyr::filter(rownames(.) %in% colnames(abm1_data))
Plot cell type counts.
data.frame(
x = abm1_meta$celltype,
donor = abm1_meta$donor
) %>%
ggplot(aes(x = x, fill = donor)) +
geom_bar(position = "dodge") +
ggtitle("Cell types in the 'Allen Brain M1' reference dataset") +
scale_fill_brewer(palette = "Set1") +
scale_y_log10() +
hist_theme +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank())
These marker genes were extracted from the Allen Brain M1 scRNA dataset. Technically, these should be the same as in abm1_data before filtering. Just in case, we subset them the maximal common subset.
abm1_markergenes <-
path_to("data/*/*AllenBrain-M1/b*/marker_genes.csv") %>%
from_csv() %>%
rownames() %>%
{
(.)[(.) %in% rownames(fgcz_data)]
} %>%
{
(.)[(.) %in% rownames(abm1_data)]
}
stopifnot(117 == length(abm1_markergenes))
In this section we use Darmanis et al as the reference dataset.
Repackage the data following the Bisque vignette.
fgcz_eset <- Biobase::ExpressionSet(
# Subset to marker genes for residual norm computation
assayData = as.matrix(fgcz_data[abm1_markergenes, ])
)
darm_eset <- Biobase::ExpressionSet(
# Expression data
assayData = as.matrix(darm_data),
# Metadata
phenoData = Biobase::AnnotatedDataFrame(
data = data.frame(
row.names = rownames(darm_meta),
cellType = darm_meta$celltype,
SubjectName = darm_meta$experiment_sample_name,
check.names = FALSE,
check.rows = FALSE,
stringsAsFactors = FALSE
),
varMetadata = data.frame(
row.names = c("cellType", "SubjectName"),
labelDescription = c("cellType", "SubjectName")
)
)
)
Deconvolution.
bisque_report <-
BisqueRNA::ReferenceBasedDecomposition(
# BULK DATA
bulk.eset = fgcz_eset,
# REFERENCE
sc.eset = darm_eset,
#
use.overlap = FALSE,
markers = abm1_markergenes,
verbose = FALSE
)
# Fields of the Bisque result
bisque_report %>%
summary() %>%
kable()
| Length | Class | Mode | |
|---|---|---|---|
| bulk.props | 612 | -none- | numeric |
| sc.props | 48 | -none- | numeric |
| rnorm | 102 | -none- | numeric |
| genes.used | 109 | -none- | character |
| transformed.bulk | 11118 | -none- | numeric |
Bisque returns proportions that sum to one:
stopifnot(max(abs(1 - colSums(bisque_report$bulk.props))) <= 1e-10)
We undo that using the residual. The meaning of rnorm is not clear from the manual, so this might be incorrect:
explained_fractions <-
bisque_report %>% {
1 - (sqrt((.)$rnorm) / sqrt(colSums(fgcz_data[(.)$genes.used, ]**2)))
}
Rescale proportions to the “explained fractions”.
fgcz_darm_bisque <-
t(t(bisque_report$bulk.props) * explained_fractions)
Cluster bulk samples by composition.
samples_order <-
fgcz_darm_bisque %>%
{
(.)[, hclust(dist(t(.)))$order]
} %>%
colnames()
Save to disk and visualize inferred cell type composition by bulk sample.
save_heat_bars_data <-
function(., prefix = deparse(substitute(.))) {
as <- (function(ext) paste(prefix, ext, sep = "_"))
(.)[order(tolower(rownames(.))), samples_order] %>%
#
save_heatmap(as("heat.png")) %>%
save_barchart(as("bars.png")) %>%
#
as.data.frame() %>%
ind2col("celltype") %>%
to_csv(as("data.csv"))
}
save_heat_bars_data(fgcz_darm_bisque) %>% hush()
The cell type proportions returned by Bisque vary rather more than expected. The samples come from four different hospitals (the field fgcz_meta$Source), which could introduce the strongest batch effect. We can also compare that to the RIN (RNA integrity number). Age could be another important factor.
PC1_and_friends <-
function(.) {
plots <-
prcomp(t(.))$x %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta) %>%
cbind(y = (2 * rank((.)$PC1) / length((.)$PC1) - 1)) %>%
{
# Make several plots silently
lapply(
c("PC2", "RIN", "Age"),
function(field) {
cbind((.), x = (.)[[field]]) %>% {
ggplot(., aes(x = x, y = y, shape = Condition, color = Source)) +
labs(
x = field,
y = "Deviation from the average composition (~PC1)",
color = "Sample source",
shape = "Condition"
) +
geom_point(size = 5, alpha = 0.2 + abs((.)$y)) +
scale_color_brewer(palette = "Set1")
}
}
)
}
t <- theme(
legend.position = "none",
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
)
filename <- out_file(paste(
deparse(substitute(.)),
paste(lapply(plots, function(plot) plot$labels$x), collapse = "_"),
"vs__PC1.png",
sep = "__"
))
plots <- gridExtra::arrangeGrob(
grobs = c(
# Remove redundant stuff
lapply(plots, function(.) (. + t)),
# but keep one legend
list(cowplot::get_legend(plots[[1]]))
),
# All grobs in one row
nrow = 1,
# Y-axis label
left = grid::grid.text(
plots[[1]]$labels$y,
rot = 90,
draw = FALSE,
gp = grid::gpar(fontsize = ggplot2::theme_get()$text$size)
)
)
ggsave(
filename = filename,
plot = plots,
width = WIDTH1,
height = WIDTH1 / (length(plots) - 1)
)
return(filename)
}
Principal components of cell type proportions, etc.
PC1_and_friends(fgcz_darm_bisque) %>% hush()
# Prevent errors from MuSiC
exprs <- xbioc::exprs
pVar <- xbioc::pVar
music_report.darm <-
MuSiC::music_prop(
bulk.eset = fgcz_eset,
sc.eset = darm_eset,
clusters = darm_meta$celltype,
samples = names(darm_data)
) %>%
suppressMessages()
music_report.darm %>%
summary() %>%
kable()
| Length | Class | Mode | |
|---|---|---|---|
| Est.prop.weighted | 612 | -none- | numeric |
| Est.prop.allgene | 612 | -none- | numeric |
| Weight.gene | 11118 | -none- | numeric |
| r.squared.full | 102 | -none- | numeric |
| Var.prop | 612 | -none- | numeric |
Estimate cell type proportions (celltype x sample).
fgcz_darm_music <-
t(music_report.darm$Est.prop.weighted) %>% {
(.)[rownames(fgcz_darm_bisque), ]
}
save_heat_bars_data(fgcz_darm_music) %>% hush()
PC1_and_friends(fgcz_darm_music) %>% hush()
We would like to know whether there is the cell type composition differs in ALS vs Control samples. First, the following figure shows the distribution of the first principal component (of cell type proportions) separated by sample source.
fgcz_darm_music.pca <-
fgcz_darm_music %>%
{
as.data.frame(prcomp(t(.))$x)
} %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta)
fgcz_darm_music.pca %>% {
ggplot(., aes(x = PC1, fill = Source, linetype = Source)) +
labs(fill = "Sample source") +
guides(linetype = FALSE) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 20, color = 1, linetype = 1
) +
hist_theme +
theme(axis.text.y = element_blank()) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
) +
scale_fill_brewer(palette = "Set1")
}
The separation by Source seen in the above figure suggests looking at the Condition (ALS vs Control) for each Source individually.
fgcz_darm_music.pca %>%
{
ggplot(., aes(x = PC1, fill = Condition)) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 20, color = 1, linetype = 1
) +
hist_theme +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank()
) +
scale_fill_brewer(palette = "Set1") +
facet_grid(cols = vars(Source))
} %>%
ggsave(
filename = out_file("fgcz_darm_music_PC1.png"),
width = WIDTH1,
height = WIDTH1 / 4
) %>%
hush()
Among those, samples from Lo show a conspicuous further separation by Condition, supported by the following two-sided two-sample t-test.
fgcz_darm_music.pca %>%
filter(Source != "PD") %>%
group_by(Source) %>%
group_map(~ {
t.test(
x = filter((.x), Condition == "ALS")$PC1,
y = filter((.x), Condition != "ALS")$PC1,
alternative = "two"
) %>% {
list(
`Sample source` = dplyr::pull((.y)),
`p-value` = round((.)$p.value, 3),
`t-statistic` = round((.)$statistic, 2),
`#dof` = round((.)$parameter, 1)
)
}
}) %>%
data.table::rbindlist() %>%
kable()
| Sample source | p-value | t-statistic | #dof |
|---|---|---|---|
| Lo | 0.023 | -2.48 | 18.2 |
| NB | 0.672 | -0.43 | 17.0 |
| Ox | 0.616 | -0.51 | 21.6 |
Ordering the samples by the first principal component reveals its meaning as a measure of the content of astrocytes vs neurons. The samples from Lo suggest that high proportion of neurons is associated with ALS.
fgcz_darm_music %>%
t() %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(Source = fgcz_meta$Source) %>%
cbind(PC1 = fgcz_darm_music.pca$PC1) %>%
tibble::rownames_to_column("ID") %>%
mutate(ID = paste(ID, if_else(fgcz_meta$Condition == "ALS", "(ALS)", ""))) %>%
arrange(PC1) %>%
select(-PC1) %>%
group_by(Source) %>%
group_map(~ {
(.x) %>%
tibble::column_to_rownames("ID") %>%
t() %>%
save_barchart(paste("fgcz_darm_music_", pull(.y), "_bars.png", sep = ""))
}) %>%
hush()
Lo:
NB:
Ox:
PD:
For the code see here.
fgcz_darm_nnls <-
path_to("code/*/*-NNLS_Darmanis/a_*/celltypes.csv") %>%
# The first column is not unique
# Keep header as is, e.g. "H-0C083"
utils::read.delim(., check.names = F) %>%
# `cell type` ~> `celltype`
fix_names() %>%
# Collapse cell types:
group_by_and_sum("celltype")
save_heat_bars_data(fgcz_darm_nnls) %>% hush()
PC1_and_friends(fgcz_darm_nnls) %>% hush()
These results are similar enough to MuSiC’s to warrant a closer look. For each sample we compute the cosine similarity between the two methods and plot a histogram of those separated by sample source.
similarity_plot <- function(X1, X2) {
X1 <- X1[order(rownames(X1)), order(colnames(X1))]
X2 <- X2[order(rownames(X2)), order(colnames(X2))]
# Check that samples are arranged consistently
stopifnot(all(colnames(X1) == colnames(X2)))
stopifnot(all(rownames(X1) == rownames(X2)))
colSums(col_norm2(X1) * col_norm2(X2)) %>%
data.frame(
x = .,
s = fgcz_meta$Source
) %>%
{
ggplot(., aes(x = x, fill = s, linetype = s)) +
labs(fill = "Sample source") +
guides(linetype = FALSE) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 30, color = 1, linetype = 1,
) +
hist_theme +
theme(axis.text.y = element_blank()) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
) +
scale_fill_brewer(palette = "Set1")
}
}
similarity_plot(fgcz_darm_music, fgcz_darm_nnls) +
labs(x = "Cosine similarity of a sample (MuSiC vs NNLS)")
For the code see here.
fgcz_darm_sieve <-
path_to("code/*/*-RNASieve/a_*/celltypes.csv") %>%
from_csv() %>%
fix_names()
save_heat_bars_data(fgcz_darm_sieve) %>% hush()
PC1_and_friends(fgcz_darm_sieve) %>% hush()
similarity_plot(fgcz_darm_music, fgcz_darm_sieve) +
labs(x = "Cosine similarity of a sample (MuSiC vs RNA-Sieve)")
This preprocesses the reference dataset.
# This takes some 20min
prediction <- ROCR::prediction
performance <- ROCR::performance
zlm <- MAST::zlm
dwls_darm_signature <-
DWLS::buildSignatureMatrixMAST(
scdata = darm_data,
id = setNames(as.list(darm_meta[["celltype"]]), rownames(darm_meta)),
path = out_file("dwls_darm_signature/")
)
This applies deconvolution to each bulk sample.
fgcz_darm_dwls <-
fgcz_data %>%
{
genes <- intersect(rownames(dwls_darm_signature), rownames(.))
pbapply::pblapply(
(.)[genes, ],
function(sample) {
# Silence silly printout
purrr::quietly(DWLS::solveDampenedWLS)(
as.matrix(dwls_darm_signature[genes, ]), as.matrix(sample)
)$result
},
# Parallel processes -- requires (cl x RAM)
cl = 1
)
} %>%
as.data.frame(check.names = F)
fgcz_darm_dwls <-
fgcz_darm_dwls %>%
{fgcz_darm_dwls[order(tolower(rownames(.))), ]}
fgcz_darm_dwls %>%
save_heat_bars_data("fgcz_darm_dwls") %>%
hush()
similarity_plot(fgcz_darm_music, fgcz_darm_dwls) +
labs(x = "Cosine similarity of a sample (MuSiC vs DWLS)")
In this section we use Allen Brain M1 as the reference dataset.
Repackage:
abm1_eset <- Biobase::ExpressionSet(
# Expression data
assayData = as.matrix(abm1_data),
# Metadata
phenoData = Biobase::AnnotatedDataFrame(
data = data.frame(
row.names = rownames(abm1_meta),
cellType = abm1_meta$celltype,
SubjectName = abm1_meta$donor,
check.names = FALSE,
check.rows = FALSE
),
varMetadata = data.frame(
row.names = c("cellType", "SubjectName"),
labelDescription = c("cellType", "SubjectName")
)
)
)
Deconvolution:
bisque_report <-
BisqueRNA::ReferenceBasedDecomposition(
# BULK DATA
bulk.eset = fgcz_eset,
# REFERENCE
sc.eset = abm1_eset,
#
use.overlap = FALSE,
markers = abm1_markergenes,
verbose = FALSE
)
## Warning in BisqueRNA::ReferenceBasedDecomposition(bulk.eset = fgcz_eset, :
## Only two individuals detected in single-cell data. While Bisque will run, we
## recommend at least three subjects for reliable performance.
bisque_report$bulk.props %>%
save_heat_bars_data("fgcz_abm1_bisque") %>%
hush()
# Read from disk if already available
# otherwise it takes ages
if (!file.exists(out_file("fgcz_abm1_music_data.csv"))) {
# MuSiC with ABM1
music_report.abm1 <-
MuSiC::music_prop(
bulk.eset = fgcz_eset,
sc.eset = abm1_eset,
clusters = abm1_meta$celltype,
samples = names(abm1_data)
) %>%
suppressMessages()
fgcz_abm1_music <-
t(music_report.abm1$Est.prop.weighted)
} else {
fgcz_abm1_music <-
from_csv(out_file("fgcz_abm1_music_data.csv"))
}
fgcz_abm1_music %>%
save_heat_bars_data("fgcz_abm1_music") %>%
hush()
(in progress)
(in progress)
RA, Cell type deconvolution from bulk RNA-seq of the human brain, 2020-12-29, http://bit.ly/deco_ra.